Add new INT4 quantization features to model builder #940

kunal-vaishnavi · 2024-09-27T19:27:14Z

Description

This PR adds new INT4 quantization features to the model builder.

The model builder can now quantize the embedding layer and the language modeling head to INT4 precision by default.
For already-quantized PyTorch models that are passed to the model builder, any ops that are still created with MatMul can now be quantized to MatMulNBits via RTN.
A new optional flag in the extra options called int4_op_types_to_quantize has been added to allow more flexibility with INT4 quantization.

Motivation and Context

With these PR changes, the size of the ONNX models can be reduced by quantizing the embedding layer and/or the language modeling head.

For the ONNX models built from already-quantized PyTorch models, one example is with using AutoAWQ. AutoAWQ does not quantize the language modeling head. The resulting ONNX model typically contains a MatMul op for the language modeling head. Now, that MatMul op will be quantized via RTN to MatMulNBits to reduce memory.

src/python/py/models/builder.py

fajin-corp

src/python/py/models/builder.py

As title. This enables support for further reduced quantized model size and improved runtime efficiency, within acceptable range of accuracy degradation. Orthogonal to #940. This PR targets already quantized models in autoawq/autogptq format that **has** lmhead quantized.

### Description This PR adds new INT4 quantization features to the model builder. 1. The model builder can now quantize the embedding layer and the language modeling head to INT4 precision by default. 2. For already-quantized PyTorch models that are passed to the model builder, any ops that are still created with `MatMul` can now be quantized to `MatMulNBits` via RTN. 3. A new optional flag in the extra options called `int4_op_types_to_quantize` has been added to allow more flexibility with INT4 quantization. ### Motivation and Context With these PR changes, the size of the ONNX models can be reduced by quantizing the embedding layer and/or the language modeling head. For the ONNX models built from already-quantized PyTorch models, one example is with using AutoAWQ. AutoAWQ does not quantize the language modeling head. The resulting ONNX model typically contains a `MatMul` op for the language modeling head. Now, that `MatMul` op will be quantized via RTN to `MatMulNBits` to reduce memory.

As title. This enables support for further reduced quantized model size and improved runtime efficiency, within acceptable range of accuracy degradation. Orthogonal to #940. This PR targets already quantized models in autoawq/autogptq format that **has** lmhead quantized.

kunal-vaishnavi added 2 commits September 27, 2024 11:03

Quantize LM head to int4 for already-quantized models

c1d44e1

Quantize embeddings and LM head to int4 by default

0b3d492

kunal-vaishnavi requested a review from yufenglee September 27, 2024 19:27

Remove extra parenthesis

01164db

yufenglee reviewed Sep 30, 2024

View reviewed changes

src/python/py/models/builder.py Outdated Show resolved Hide resolved

kunal-vaishnavi added the 0.5.0 label Oct 28, 2024

kunal-vaishnavi added 2 commits October 30, 2024 20:59

Merge branch 'main' into kvaishnavi/int4-embeddings

3a6a58b

Change default for int4 embedding quantization

6a9ec28

BowenBao mentioned this pull request Oct 31, 2024

Extend builder support for quantized lm_head #1022

Merged

Update builder.py

eb905f7

fajin-corp approved these changes Oct 31, 2024

View reviewed changes

hanbitmyths reviewed Oct 31, 2024

View reviewed changes

src/python/py/models/builder.py Show resolved Hide resolved

Prevent duplicate quantization for already-quantized, use-QDQ models

34667ee

hanbitmyths approved these changes Oct 31, 2024

View reviewed changes

Fix typo

0ab6c96

kunal-vaishnavi merged commit 2c01695 into main Nov 1, 2024
13 checks passed

kunal-vaishnavi deleted the kvaishnavi/int4-embeddings branch November 1, 2024 22:43

jambayk mentioned this pull request Nov 8, 2024

Add Quantized_model + float LoRA model scenario to model builder #1043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new INT4 quantization features to model builder #940

Add new INT4 quantization features to model builder #940

kunal-vaishnavi commented Sep 27, 2024 •

edited

Loading

fajin-corp left a comment

Add new INT4 quantization features to model builder #940

Add new INT4 quantization features to model builder #940

Conversation

kunal-vaishnavi commented Sep 27, 2024 • edited Loading

Description

Motivation and Context

fajin-corp left a comment

Choose a reason for hiding this comment

kunal-vaishnavi commented Sep 27, 2024 •

edited

Loading